Post-editing apparatus and method for correcting translation errors

ABSTRACT

A post-editing apparatus for correcting translation errors, includes: a translation error search unit for estimating translation errors using an error-specific language model suitable for a type of error desired to be estimated from translation result obtained using a translation system, and determining an order of correction of the translation errors; and a corrected word candidate generator for sequentially generating error-corrected word candidates for respective estimated translation errors on a basis of analysis of an original text of the translation system. The post-editing apparatus further includes a corrected word selector for selecting a final corrected word from among the error-corrected word candidates by using the error-specific language model suitable for the type of error desired to be corrected, and incorporating the final corrected word in the translation result, thus correcting the translation errors.

CROSS-REFERENCE(S) TO RELATED APPLICATION(S)

The present invention claims priority of Korean Patent Applications No.10-2008-0120911, filed on Dec. 2, 2008, and No. 10-2009-0027750, filedon Mar. 31, 2009, which are incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to a post-editing technology forcorrecting translation errors of a machine translation system usinglanguage models specified for respective error types, and, moreparticularly, to a post-editing apparatus and method for correctingtranslation errors, which are suitable for the improvement oftranslation quality by designating priorities for error correction inconformity with the characteristics of translation errors andsequentially correcting translation errors according to the prioritiesusing language models specified for the types of translation errors.

BACKGROUND OF THE INVENTION

The performance of a machine translation system which translatessentences of one language into another language has been continuouslyimproving. However, such a machine translation system still makes manytranslation errors. In order to remove translation errors, theperformance of relevant modules in a translation engine must beimproved. However, this method is problematic in that, since anindividual module causing corresponding error must be directlycorrected, a new translation module must be implemented for errorcorrection when the development of the translation system has alreadybeen completed. In addition, such a method is problematic in that, sinceerror correction in individual modules does not consider whole generatedsentence, there is a high probability that accurate translation is notperformed and errors still remain, and in that various types of errorsare not solved at once. Due to these problems, for the improvement ofthe performance of a machine translation system, a post-editing oftranslation error, capable of automatically correcting errors occurringin the final translation results by using a post-processing scheme, isuseful.

Recently, many statistics-based machine translation systems have beendeveloped, but they do not exhibit excellent performance in the case oflanguage pairs such as the Korean-English which are quite differentowing to the difference in their word order. Actually, commercializedmachine translation systems are rule or pattern-based machinetranslation systems. One of the great characteristics of the translationresults made by using the rule or pattern-based machine translationsystems is that, in many cases, although the meaning of a translatedsentence is correct, the translated sentence is not natural, or isawkward due to a grammatical error.

Meanwhile, a language model may be used to estimate errors of a machinetranslation system. Such language models are built in the form of adatabase (DB) of the probabilities of a sequence of specific wordsappearing in a large corpus. The language models are used as indices forappropriately used expressions of a target language in astatistics-based machine translation. Therefore, the language models mayprovide basis for automatically finding a portion in which errors haveoccurred by comparing a translation created by a machine translationsystem with the built language models and for accurately correcting thatportion.

Errors of a machine translation system may be estimated using an n-gramlanguage model which is one type of the conventional basic languagemodels. As n is increased, much surrounding context may be viewed fromthe language model, but model data insufficiency may occur. Further,based on a simple n-gram model, estimation of error occurring inlong-distance dependency is difficult. Moreover, since only simplearrangement of words is considered while building the n-gram languagemodel, unnecessary word sequences, i.e., erroneous word sequences suchas noise, are recognized as correct word sequences, thus decreasingaccuracy in error detection and correction.

Therefore, there is a need to build a new language model forpost-editing capable of handling long-distance dependency and preventingnoise from occurring in the language model.

Though, one or more translation errors may coexist in one translatedsentence, conventional post-editing systems for correcting translationerrors does not consider the sequence of processing the coexistingtranslation errors. Therefore, in order to improve entire correctionperformance of the language model-based post-editing system, a techniqueprimarily correcting an error having higher priority in consideration ofthe priorities of the coexisting errors is required.

Furthermore, the existing post-editing system is configured in aloosely-coupled structure in which it is difficult for a post-editingsystem to refer to information analyzed and generated by the translationengine of a translation system which performs actual translation.However, better translation performance may be achieved if errors arecorrected with reference to information about an analysis of originaltext or a translated text by using a rule or pattern-based translationengine.

SUMMARY OF THE INVENTION

It is, therefore, an object of the present invention to provide apost-editing apparatus and method for correcting translation errors infinal translation generated by a machine translation system by providinga scheme for automatically detecting the errors and for correcting theerrors, thereby obtaining high quality translation result.

Another object of the present invention is to provide a post-editingapparatus and method for correcting translation errors, which designatepriorities for error correction in conformity with the characteristicsof translation errors of a machine translation system and sequentiallycorrect the translation errors according the priorities by usinglanguage models specified for the types of translation errors, thusimproving translation quality.

A further object of the present invention is to provide a post-editingapparatus and method for correcting translation errors, which caneffectively identify mistranslation, designate priorities for correctingthe mistranslation, and correct the mistranslation by using languagemodels specified for the types of errors.

In accordance with one aspect of the present invention, there isprovided a post-editing apparatus for correcting translation errors,including: a translation error search unit for estimating translationerrors using an error-specific language model suitable for a type oferror desired to be estimated from translation result obtained using atranslation system, and determining an order of correction of thetranslation errors; a corrected word candidate generator forsequentially generating error-corrected word candidates for respectiveestimated translation errors on a basis of analysis of an original textof the translation system; and a corrected word selector for selecting afinal corrected word from among the error-corrected word candidates byusing the error-specific language model suitable for the type of errordesired to be corrected, and incorporating the final corrected word inthe translation result, thus correcting the translation errors.

In accordance with another aspect of the present invention, there isprovided a post-editing method for correcting translation errors,including: estimating translation errors using an error-specificlanguage model suitable for a type of error desired to be estimated fromtranslation result obtained using a translation system; generatingerror-corrected word candidates for respective estimated translationerrors on a basis of an analysis of original text by the translationsystem; and selecting a final corrected word from among theerror-corrected word candidates using the error-specific language modelsuitable for the type of error desired to be corrected, andincorporating the final corrected word in the translation result, thuscorrecting the translation errors.

According to embodiments of the present invention, there is an advantagein that translation errors of a machine translation system such asasyntactic or unnatural expressions can be corrected in real time, thusimproving the translation performance of the machine translation system.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects and features of the present invention willbecome apparent from the following description of preferred embodimentsgiven in conjunction with the accompanying drawings, in which:

FIG. 1 is a block diagram showing the construction of an error-specificlanguage model builder according to an embodiment of the presentinvention;

FIG. 2 is a flowchart showing a process for building an error-specificlanguage model according to an embodiment of the present invention;

FIG. 3 is a block diagram showing the construction of a post-editingapparatus for correcting translation errors according to an embodimentof the present invention; and

FIG. 4 is a flowchart showing the operating process of the post-editingapparatus for correcting translation errors according to an embodimentof the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Hereinafter, preferred embodiments of the present invention will bedescribed in detail with reference to the accompanying drawings. In thefollowing description of the present invention, if detailed descriptionsof related well-known constructions or functions are determined to makethe gist of the present invention unclear, the detailed descriptionswill be omitted. The following terms are defined considering theirfunctions in the present invention. Since the meanings of the terms mayvary according to a user's or an operator's intention or usual practice,the meanings of the terms must be interpreted based on the overallcontext of the present specification.

The present invention is intended to automatically detect errors in thefinal translation generated by a machine translation system and correctthe errors to provide an accurate translation. The present invention isconfigured such that, after obtaining the final translation from a givendata, a post-processing apparatus for correcting translation errorssearches for mistranslations, designates priorities for error correctionin conformity with the characteristics of the translation errors, andsequentially corrects the translation errors according to the prioritiesby using language models specified for the types of translation errors,thus improving translation quality.

There are various types of translation errors in the machine translationsystem and they may be classified using a variety of methods. Amongthese classification methods, when English is used as the targetlanguage, translation errors may be classified as follows:

1) Word choice error: an error in the selection of a word fortranslation such as a noun, a verb, an adjective, an adverb, an article,a preposition, and an auxiliary verb, an error in singular/pluralagreement, and an error in determining whether singular or plural;

2) Word presence error: an error related to the presence of an article,a preposition, an auxiliary verb or adjective, and so forth; and

3) Word order error: an error in adjective sequence, the order innominal compound, etc.

Here, word choice error refers to an error occurring in the case wherethe translation engine of the machine translation system generates wrongwords. Word presence error refers to an error occurring in the casewhere words such as an article or a preposition are not present atnecessary locations or are present at unnecessary locations. Word ordererror refers to an error occurring in the case where, when a word ismodified by various adjectives or various adverbs, the order of thesemodifiers is incorrect, or where the order of nouns in nominal compoundis incorrect.

An n-gram language model-based error correction scheme corrects errorsbased on whether word sequences after translation are appeared in acorpus, and the basic idea thereof is described as follows. In aKorean/English machine translation system, if an English sentence “Iwent to the school” is obtained as the result of translation withrespect to a Korean sentence “

”, less frequently appearing word sequence, equal to or less than athreshold, among the following 3-gram data, is detected as an error in a3-gram error correction model. The following examples are simpleexamples of 3-gram data. The left side denotes word sequences and theright side denotes the frequencies of the respective word sequencesappearing in the corpus. Actual data may have a form different from thatof the examples, i.e., corrected data values rather than simpleappearance frequencies.

$_I_went 200 I_went_to 100 went_to_the 120 to_the_school 15

The n-gram data is made based on information about the frequencies ofcorresponding word sequences in the corpus. However, when n-gram data ismerely made based on the information about the frequencies of wordsequences, there is a high probability that data will be insufficient orthat inappropriate word sequences, having no meaning as n-grams, will bemade.

Because of this concern, a method of extracting n-gram data usingstructure analysis is employed. When n-gram data is extracted from adependency tree obtained by analyzing each dependency, there is anadvantage of acquiring information about word sequences havinglong-distance dependency.

However, even in this case with the analyzed dependency, the languagemodel-based post-editing method does not perform well. Therefore, apost-editing scheme for correcting translation errors, which is based onerror-specific language models, is proposed in the embodiments of thepresent invention.

Embodiments

FIG. 1 is a conceptual diagram showing the building of an error-specificlanguage model according to an embodiment of the present invention.

Referring to FIG. 1, an error-specific language model builder 100receives a target language corpus as a training corpus to build alanguage model, thereby generating an error-specific language model 110including a word choice error language model 112, a word order errorlanguage model 114 and a word presence error language model 116 suitablefor the correction of word choice error, word order error and wordpresence error, respectively, through the target language corpus.

FIG. 2 is a flowchart showing a process for building an error-specificlanguage model according to an embodiment of the present invention.

Referring to FIG. 2, after receiving a target language corpus, theerror-specific language model builder 100 builds a language model in aform suitable for correcting errors, in which the language model isbased on dependency grammar. In order to build language models suitablefor respective error types, the language models are built by definingfactors required for the correction of relevant errors for therespective error types. Accordingly, even if a language model may bebuilt for a given sentence from the same dependency tree, the languagemodel is built differently according to the type of error. First, afterreceiving the target language corpus, dependencies in sentencescontained in the target language corpus, i.e., a learning corpus, isanalyzed to build a language model (step 200). Next, factors, to bedescribed later, of respective words having dependency with a specificword are extracted to correct word errors for respective error types(word choice, word presence, word order) (step 202). The final languagemodel is built by smoothing process based on frequency information aboutthe words included in the extracted factors (step 204).

In order to build language models specified for respective error types,each word constituting a sentence may be defined as k factors. Here, aword w is composed of k factors f¹, f², . . . , f^(k), and may berepresented by the following Equation (1).

w≡{f¹, f², . . . , f^(k)}=f^(1:k)  (1)

In this case, a probability that a word w_(i) having dependencyinformation d₁, d₂, . . . , d_(n-1), as context information is a correctword, that is, P(w_(i)|d₁, d₂, . . . , d_(n-1))), is represented by thefollowing Equation (2),

P(w _(i)|(d ₁ , d ₂ , . . . , d _(n-1)))=P(f _(i) ^(1:k)|(f _(d1) ^(1:k), f _(d2) ^(1:k) , . . . , f _(dn-1) ^(1:k)))  (2)

where f_(i) ^(1:k) denotes factors of w_(i), and f_(dj) ^(1:k) denotesfactors of a word d_(j) having dependency with w_(i).

In order to build a language model for the correction of word choiceerror, only basic forms of words are defined as factors, and the word wis defined by the following Equation (3).

w≡{f¹=f^(s), basic form of word}  (3)

This is due to the assumption that, when w is a content word, it ispossible to determine word choice error using only the basic form ofrelated surrounding content words. Therefore, a language model for thecorrection of content word choice error is given by the followingEquation (4).

P _(cw)(w _(i)|(d ₁ , d ₂ , . . . , d _(n-1)))=P _(cw)(f _(i) ^(s)|(f_(d1) ^(s) , f _(d2) ^(s) , . . . , f _(dn-1) ^(s)))  (4)

That is, the language model is built by extracting the frequencyinformation of content words under dependency of basic form from thedependency analyzed target language corpus.

For the correction of word presence error, if it is assumed that factorsrequired by the language model is information about word sequences ofall words under dependency, the building of the language model isperformed by extracting word sequence information of all words underdependency with respect to a specific word from the dependency analyzedlearning corpus. In a similar manner, the building of a language modelfor the correction of word order error is performed by extracting wordsequence information of all words having dependency with the currenttarget word.

FIG. 3 is a block diagram showing the construction of a post-editingapparatus for correcting translation errors according to an embodimentof the present invention.

Referring to FIG. 3, a post-editing apparatus 300 for correctingtranslation errors includes an error search unit 302, a corrected wordcandidate generator 304, and a corrected word selector 306.

The error search unit 302 searches mistranslation from the translationresult obtained using the machine translation system. At this time, theerror search unit 302 searches or estimates the translation errors onthe basis of the error-specific language model 110 generated by theerror-specific language model builder 100, and determines correctionorders therebetween.

Specifically, the error search unit 302, which is configured to correctword choice error and word presence error, estimates the probabilitiesof errors depending on probability models preset for respective errorsand regards estimated errors as actual errors when the estimated errorprobabilities are equal to or less than a threshold value.

Further, after the error search unit 302 estimates the errors, it alignsthe estimated errors by priorities according to correction sequences.The priorities are designated as follows.

1) Content words have higher correction priority than function words,and, among the content words, a word having a high probability of theoccurrence of errors according to an error estimation model has highercorrection priority.

2) Modifier has higher correction priority than modified, and among themodifiers, a word having a high probability of the occurrence of errorsaccording to the error estimation model has higher correction priority.

3) Word choice error has higher correction priority than word ordererror.

The corrected word candidate generator 304 sequentially generateserror-corrected word candidates according to priorities with respect toestimated translation errors when the estimation of errors has beenperformed by the error search unit 302 on the basis of information aboutan original text analyzed by the translation engine of the machinetranslation system.

Further, in order to generate error-corrected word candidates, othertranslation candidates are retrieved using results analyzed by themachine translation system, dictionary information, and so forth. In thecase of word choice error, e.g., for English/Korean translation,information about other translated word candidates are retrieved basedon the Korean dictionary information, thus generating the wordcandidates. In the case of word order error, error-corrected wordcandidates are generated by permuting the order of relevant words.

Moreover, after receiving the error-corrected word candidates generatedby the corrected word candidate generator 304, the corrected wordselector 306 calculates, for the generated candidates of the actualerroneous sentence, the probabilities based on the error-specificlanguage model 110. A word having the highest probability exceeding athreshold among the calculated probabilities is selected as a correctedword.

FIG. 4 is a flowchart showing the operating process of a post-editingapparatus for correcting translation errors according to an embodimentof the present invention.

Referring to FIG. 4, translation errors are estimated by using theerror-specific language model from the translation result of thetranslation system, and the estimated translation errors are aligned bypriorities (step 400).

Next, after receiving the estimated translation errors aligned by therespective priorities, the corrected word candidates are sequentiallygenerated in accordance with the priorities (step 402). At this time,other translation candidates may be retrieved using analysis of thetranslation system, dictionary information, and so forth.

A final corrected word is selected from among the error-corrected wordcandidates using the error-specific language model suitable for the typeof error (step 404).

The errors are corrected by incorporating the selected final correctedword into the translation result data (step 406).

As described above, the embodiments of the present invention areintended to automatically detect errors in the final translationgenerated by the machine translation system and correct the errors toprovide an accurate translation. Further, the embodiments are configuredsuch that, after the machine translation system translates given data, apost-processing apparatus for correcting translation errors searches fora portion in which mistranslation has occurred, designates prioritiesfor error correction in conformity with the characteristics of the foundtranslation errors and sequentially corrects the translation errorsaccording to the priorities by using language models specified for thetypes of translation errors, thus improving translation quality.

While the invention has been shown and described with respect to thepreferred embodiments, it will be understood by those skilled in the artthat various changes and modifications may be made without departingfrom the scope of the invention as defined in the following claims.Therefore, the scope of the present invention is not limited to theabove-described embodiments and should be defined by the claims andequivalents thereof.

1. A post-editing apparatus for correcting translation errors,comprising: a translation error search unit for estimating translationerrors using an error-specific language model suitable for a type oferror desired to be estimated from translation result obtained using atranslation system, and determining an order of correction of thetranslation errors; a corrected word candidate generator forsequentially generating error-corrected word candidates for respectiveestimated translation errors on a basis of analysis of an original textof the translation system; and a corrected word selector for selecting afinal corrected word from among the error-corrected word candidates byusing the error-specific language model suitable for the type of errordesired to be corrected, and incorporating the final corrected word inthe translation result, thus correcting the translation errors.
 2. Thepost-editing apparatus of claim 1, wherein the error-specific languagemodel is built from a target language corpus in a form specified fortypes of translation errors.
 3. The post-editing apparatus of claim 2,wherein the error-specific language model is built in such a way thatfactors required for correction of errors for respective error types areseparately defined for word choice error, word order error based on thecorpus for which dependency was analyzed.
 4. The post-editing apparatusof claim 3, wherein the word choice error is at least one of an error inselection of a translated word such as a noun, a verb, an adjective, anadverb, an article, a preposition, and an auxiliary verb, an error insingular/plural agreement, and an error in determining whether singularor plural.
 5. The post-editing apparatus of claim 3, wherein the wordorder error is an error in an order of adjective sequence and an orderin nominal compound.
 6. The post-editing apparatus of claim 3, whereinthe word presence error is an error related to presence of an article, apreposition and an auxiliary verb or adjective.
 7. The post-editingapparatus of claim 1, wherein the translation error search unit setspriorities for correction of found translation errors in conformity witherror correction priority determination rules.
 8. The post-editingapparatus of claim 7, wherein the error correction prioritydetermination rules are made in such a way that content words havehigher correction priority than function words, modifier has highercorrection priority than modified, and the word choice error has highercorrection priority than word order error, and that, among the contentwords and among the modifiers, words having a high probability of errorshave higher correction priority.
 9. The post-editing apparatus of claim7, wherein the corrected word candidate generator sequentially correctsthe errors based on the error correction priorities set by thetranslation error search unit.
 10. The post-editing apparatus of claim1, wherein the corrected word selector calculates probabilities ofsentences in which erroneous word in each erroneous sentence is replacedwith relevant error-corrected word candidate by using the error-specificlanguage model, and selects a word having a highest probability, fromamong the error-corrected word candidates, as a corrected word.
 11. Apost-editing method for correcting translation errors, comprising:estimating translation errors using an error-specific language modelsuitable for a type of error desired to be estimated from translationresult obtained using a translation system; generating error-correctedword candidates for respective estimated translation errors on a basisof an analysis of original text by the translation system; and selectinga final corrected word from among the error-corrected word candidatesusing the error-specific language model suitable for the type of errordesired to be corrected, and incorporating the final corrected word inthe translation result, thus correcting the translation errors.
 12. Thepost-editing method of claim 11, wherein the error-specific languagemodel is built from a target language corpus in a form specified fortypes of translation errors.
 13. The post-editing method of claim 12,wherein the error-specific language model is built in such a way thatfactors required for correction of errors for respective error types areseparately defined for word choice error, word order error and wordpresence error, based on the corpus for which dependencies wereanalyzed.
 14. The post-editing method of claim 13, wherein the wordchoice error is at least one of an error in selection of a translatedword such as a noun, a verb, an adjective, an adverb, an article, apreposition, and an auxiliary verb, an error in singular/pluralagreement, and an error in determining whether singular or plural. 15.The post-editing method of claim 13, wherein the word order error is anerror in an order of adjective sequence and an order in nominalcompound.
 16. The post-editing method of claim 13, wherein the wordpresence error is an error related to presence of an article, apreposition and an auxiliary verb or adjective.
 17. The post-editingmethod of claim 11, wherein the estimating errors is performed to setpriorities for correction of found translation errors in conformity witherror correction priority determination rules.
 18. The post-editingmethod of claim 17, wherein the error correction priority determinationrules are made in such a way that content words have higher correctionpriority than function words, modifier has higher correction prioritythan modified, and the word choice error has higher correction prioritythan word order error, and that, among the content words and among themodifiers, words having a high probability of errors have highercorrection priority.
 19. The post-editing method of claim 17, whereinsaid generating the error-corrected word candidates is performed tosequentially correct the errors based on the set error correctionpriorities.
 20. The post-editing method of claim 11, wherein saidcorrecting the errors includes: calculating probabilities of sentencesin which an erroneous word in each erroneous sentence is replaced withrelevant error-corrected word candidate by using the error-specificlanguage model; and selecting a word having a highest probability, fromamong the corrected word candidates, as a corrected word.